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Abstract 

We  review  the  progress  made  in  computational  vision,  as  represented  by  Marr’s  ap¬ 
proach,  in  the  last  fifteen  years.  First,  we  briefly  outline  computational  theories  developed 
for  low,  middle  and  high-level  vision.  We  then  discuss  in  more  detail  solutions  proposed 
to  three  representative  problems  in  vision,  each  dealing  with  a  different  level  of  visual 
processing.  Finally,  we  discuss  modifications  to  the  currently  established  computational 
paradigm  that  appear  to  be  dictated  by  the  recent  developments  in  vision. 
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1  Vision  as  an  information-processing  task 

Contemporary  cognitive  science  construes  mental  processes  as  computations  defined  over  rep¬ 
resentations  [126].  Under  this  view,  visual  perception,  in  particular,  is  defined  as  the  process 
whereby  light  patterns  impinging  on  the  retina  are  transformed  into  internal  representations. 
The  problem  of  interpreting  a  retinal  input  is  illustrated  in  figure  1,  showing  a  photograph 
of  a  natural  scene  and  an  array  of  the  intensity  values  corresponding  to  a  part  of  it.  These 
intensities  are  the  raw  data  of  vision,  as  they  arrive  in  the  camera  or  at  the  retina.  To 
appreciate  the  difficulty  of  visual  recognition,  one  may  try  to  identify  the  object  (present  in 
the  scene)  described  by  this  array. 

Computational  vision  sets  out  to  understand  theoretical  aspects  of  visual  perception,  to 
explicate  possible  strategies  of  solving  the  emerging  problems,  and  to  explore  their  biological 
and  artificial  implementations.  This  paradigm  emphasizes  the  need  for  an  interdisciplinary 
approach:  machine  vision  can  serve  as  a  testbed  for  psychological  and  neural  models  of  visual 
function,  while  biological  vision  can  supply  useful  hints  in  choosing  feasible  approaches  to 
problems  in  visual  processing  by  providing  examples  of  workable  solutions. 

The  problem  of  understanding  visual  perception  in  information  processing  terms  has  been 
formulated  most  cogently  by  Marr  and  Poggio  ([92],  [96],  [93]),  who  argued  that  vision  (and, 
indeed,  any  information-processing  task)  may  be  studied  at  three  distinct  levels.  At  the  top 
level  they  placed  the  computational  theory  of  vision,  in  which  the  problem  is  characterized 
as  a  mapping  from  one  kind  of  information  to  another,  and  the  abstract  properties  of  this 
mapping  are  analyzed.  At  the  middle  level,  there  is  the  choice  of  algorithm  for  the  mapping, 
determined,  among  other  factors,  by  the  nature  of  representation  at  its  input  and  output. 
At  the  lowest  level,  there  are  the  implementational  details,  such  as  the  physical  realization 
of  the  representation  and  the  algorithm. 

In  the  original  formulation,  the  three  levels  were  only  loosely  coupled.  Nevertheless, 
it  appears  that  any  commitment  to  a  specific  computational  formulation  of  vision  would 
inevitably  affect  the  algorithm  level.  In  particular,  as  we  shall  argue,  Marr’s  definition  of 
the  computational  purpose  of  vision  have  influenced  the  algorithm-level  theories  of  an  entire 
school  of  vision  research.  Marr  held  that  the  representation  of  the  shape  of  an  object  is  quite 
different  from  the  representation  of  its  use  and  purpose,  find  that  vision  alone  can  deliver 
an  internal  description  of  the  shape  of  an  object,  even  when  the  object  is  not  recognized  in 
the  sense  of  understanding  its  use  and  purpose  ([93],  p.35).  Together  with  the  assumption 
that  a  good  way  to  represent  an  object  is  to  build  a  three-dimensional  model  of  it,  this 
view  resulted  in  many  current  theories  of  visual  recognition  relying  on  the  detection  of  3D 
primitives  [14],  or  requiring  a  library  of  3D  object  models  ([87],  [156]).  In  biological  visual 
systems,  in  comparison,  the  question  of  representation  appears  to  be  far  from  settled.  As  the 
retinal  image  is  transformed  with  each  successive  processing  stage,  the  theories  concerning 
the  nature  of  representation  diverge  more  and  more  widely,  reaching  no  consensus  as  to  what 
is  the  ultimate  visual  representation  of  objects,  scenes  and  events  ([140],  [130]). 

Solutions  to  many  visual  tasks  appear  to  be  inherently  ambigu  ms.  For  example,  assum¬ 
ing,  as  Marr  seems  to  have  done,  that  the  purpose  of  vision  is  to  reconstruct  the  spatial  layout 
of  the  outside  world,  one  is  confronted  with  the  difficult  problem  of  ambiguity  of  inferring 
the  third  dimension  from  retinal  projection.  This  ambiguity  is  a  necessary  consequence  of 
the  imaging  process,  during  which  depth  information  is  lost.  Marr  proposed  to  compensate 
for  this  loss  by  constraining  the  solution  to  the  reconstruction  problem  to  conform  to  a  priori 
assumptions,  dictated  by  our  knowledge  of  the  physical  world.  The  research  program  based 
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Figure  1:  Top:  n  photograph  of  a  natural  scone,  with  two  antelopes.  Bottom:  the  array  of 
the  intensity  values  corresponding  to  the  image  of  the  antelope  in  the  right  part  of  the  scene. 
Those  intensities  are  the  raw  data  of  vision,  as  they  arrive  in  the  camera  or  at  the  retina. 
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on  this  approach  advanced  our  understanding  of  different  aspects  of  vision,  such  as  perceptual 
grouping,  stereopsis  and  motion  perception. 

In  areas  in  which  physical  constraints  are  hard  to  define,  as  in  object  recognition  and 
scene  understanding,  the  reconstructionist  research  program  runs  into  difficulties.  Recently 
implemented  recognition  methods  ([87],  [156],  [145],  [49],  [70])  appear  to  circumvent  the 
problem  of  3D  reconstruction,  either  by  addressing  other  important  issues  in  recognition, 
such  as  viewpoint  normalization,  while  assuming  that  the  3D  models  of  objects  are  built  into 
the  system,  or  by  carrying  out  the  entire  recognition  process  in  2D. 

In  this  chapter,  we  review  the  progress  made  in  computational  vision,  as  represented 
by  Marr’s  approach,  in  the  last  fifteen  years.  Section  2  outlines  computational  theories 
developed  for  low,  middle  and  high-level  vision.  Section  3  contains  a  more  detailed  discussion 
of  solutions  proposed  to  three  representative  problems  in  vision,  each  dealing  with  a  different 
level  of  visual  processing.  The  last  section  suggests  modifications  to  the  currently  established 
computational  paradigm  that  appear  to  be  dictated  by  the  recent  developments  in  vision. 

Since  Marr’s  influence  on  vision  research  has  been  the  strongest  at  MIT,  we  follow  in 
our  review  the  development  of  the  MIT  perspective  on  computational  vision.  Other  recent 
reviews  of  computational  and  machine  vision  include  ([21],  [62],  [41],  [132]  and  [57]). 

2  The  components  of  vision 

The  analysis  of  images  can  be  done  at  different  levels.  Low-level  vision  undertakes  the 
reconstruction  of  the  visual  world,  but  only  in  a  limited  sense.  Borrowing  from  Man’s 
terminology,  its  end  product  is  a  kind  of  2|D  sketch  of  the  surrounding  scene:  a  representation 
in  which  many  attributes  of  the  visible  surfaces  are  made  explicit  and  used  as  labels  in  a 
scene-based  map.  Among  the  attributes  computed  at  this  stage  are  the  relative  motion 
and  depth  of  the  visible  features,  as  well  as  luminance,  color  and  texture  based  descriptions 
of  different  surfaces.  Information  from  the  different  visual  cues  may  be  further  processed 
by  middle-level  modules  that  compute  more  abstract  properties  defined  over  the  extracted 
features.  Combining  surface  patches  and  building  descriptions  of  objects  is  within  the  scope 
of  high-level  vision. 

The  rest  of  this  part  is  organized  as  follows:  in  sections  2.1,  2.2  and  2.3  we  review  research 
in  low-level,  middle  and  high-level  vision,  respectively,  and  in  section  2.4  we  review  some  of 
the  aspects  of  neuronal  modeling  in  the  context  of  vision. 

2.1  Low-level  vision 

One  possible  definition  of  the  domain  of  low-level  vision  is  through  the  notion  of  bottom-up, 
or  data-driven,  processing  (as  opposed  to  top-down,  or  model-driven  processing).  Low-level 
vision  is  concerned  with  those  transformations  of  the  input  image  that  are  common  to  most 
visual  tasks  and  can  be  carried  out  as  fast  as  the  input  changes,  given  the  resources  of  the 
system. 

To  some  extent,  the  analysis  of  the  different  visual  cues  can  be  done  independently  of 
each  other  and  in  parallel.  Different  behavioral  tasks  such  as  navigation,  recognition  or  high 
precision  manipulation  may  also  require  the  extraction  of  different  kinds  of  visual  information. 
It  is  not  clear,  though,  how  independent  the  analysis  of  the  different  low-level  modules  can 
and  should  be  and  to  what  extent  information  should  actually  be  integrated.  The  question  of 
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module  interdependence  may  be  approached  both  through  the  development  and  integration 
of  low-level  algorithms  and  by  the  study  of  biological  vision  systems. 

Evidence  from  human  visual  psychophysics  suggests  there  is  at  least  some  degree  of  in¬ 
dependence  among  the  modules.  For  example,  one  can  easily  understand  black  and  white 
movies  that  lack  color  and  stereo,  and  3D  structure  can  be  perceived  in  extremely  impov¬ 
erished  stimuli,  such  as  the  projections  of  rotating  wire-frame  objects  [162],  or  random-dot 
stereograms  [73].  Physiological  evidence  for  module  independence  is  provided  by  the  existence 
of  separate  subcortical  and  cortical  pathways  carrying  different  kinds  of  visual  information 
([142],  [102],  [174];  see  also  [77],  p.381).  This  independence  is  probably  not  complete:  single 
units  that  respond  to  several  different  cues,  such  as  stereo  disparity  and  direction  of  motion, 
are  found  in  some  visual  cortical  areas  ([35];  cf.  [27]). 

2.1.1  Stereopsis 

Since  the  image  on  the  retina  of  an  eye,  or  in  a  camera,  is  obtained  by  projecting  a  three- 
dimensional  world  onto  a  two-dimensional  surface,  the  information  on  the  third  dimension, 
depth,  is  lost.  However,  a  perception  of  depth  can  still  be  achieved  by  stereopsis,  i.e.,  by 
combining  information  from  two  or  more  images,  taken  simultaneously  from  slightly  different 
positions,  e.g.,  by  the  two  eyes.  In  human  vision,  binocular  stereo,  together  with  other  cues 
such  as  motion  and  perspective,  is  an  important  source  of  depth  information. 

The  importance  of  seeing  depth  in  manipulation  tasks  becomes  clear  if  one  attempts  to 
thread  a  needle,  or  to  insert  a  key  into  a  keyhole,  with  one  eye  closed.  The  importance 
of  stereo  in  recognition  tasks  seems  to  be  more  limited,  as  people  can  recognize  objects  in 
drawings  and  photographs  that  lack  binocular  information.  Nevertheless,  much  attention 
has  been  given  to  the  analysis  of  stereo  vision,  mainly  because  the  goals  of  stereo  seemed 
relatively  simple  and  well  understood,  and  because  stereo  has  been  demonstrated  by  Julesz 
to  create  a  perception  of  3D  shape  in  the  absence  of  any  other  cue  ([73];  see  figure  2). 
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Figure  2:  A  random  dot  stereogram.  The  left  and  the  right  images  should  be  viewed  by  the 
left  and  the  right  eyes,  respectively.  When  the  two  images  are  fused,  a  square  floating  in 
depth  above  the  background  appears  in  the  center.  This  is  an  example  of  shape  perception 
when  the  only  cue  is  stereopsis:  no  monocular  shape  cues  are  present  in  either  image. 

In  computer  vision  algorithms,  the  extraction  of  depth  from  binocular  stereo  begins  with 
the  formation  of  a  disparity  map  by  matching  the  two  images  (the  disparity  of  an  object  is 
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defined  as  the  difference  between  its  positions  in  the  two  images).  Thus,  a  disparity  value 
is  assigned  to  every  location  in  the  image.  The  representation  of  disparities  in  the  brain  is 
different:  cells  tuned  to  a  relatively  narrow  range  of  disparity  are  found  in  the  primary  visual 
cortex  along  with  cells  that  respond  selectively  to  features  whose  disparities  are  higher  or 
lower  than  certain  values  [118]. 

In  a  random  dot  stereogram,  e.g.  in  figure  2,  (and  in  many  natural  scenes)  matching 
objects  in  the  two  images  to  compute  disparity  is  difficult.  For  every  object  there  are  many 
potential  target  matches,  all  but  one  of  which  are  false.  Two  constraints  on  matching  were 
proposed  to  solve  the  false  targets  problem  [95].  The  first  constraint,  uniqueness,  forces  each 
object  in  one  image  to  be  matched  to  at  most  one  object  in  the  other  image.  The  second 
constraint,  smoothness,  requires  that  nearby  objects  in  one  image  be  matched  to  nearby 
objects  in  the  other  image.  Most  of  the  existing  stereo  algorithms,  as  well  as  most  of  the 
theories  of  human  stereo  vision,  incorporate  these  two  constraints. 

The  matching  of  stereo  pairs  of  complicated  scenes  remains  difficult  even  after  the  appli¬ 
cation  of  these  constraints.  For  example,  objects  may  look  different  in  the  two  images,  due  to 
different  perspective  distortions.  For  images  that  have  enough  texture,  e.g.,  in  random  dot 
stereograms,  matching  can  be  done  by  cross- correlation  of  regions  in  the  two  images  ([111], 
[171]).  Many  attempts  have  been  made  to  obtain  a  general  matching  algorithm  by  the  use 
of  more  sophisticated  matching  techniques  ([9]  is  a  review  of  early  algorithms;  see  also  [101], 
[59]).  Two  examples  are  multi- resolution  matching  (processing  the  images  at  several  levels 
of  detail;  [97],  [48],  [100])  and  adjusting  the  fixation  point  to  control  the  range  of  disparities 
in  the  images  [48].  After  intensive  development,  a  few  of  the  algorithms  appear  capable  of 
achieving  better  performance  on  a  single  stereo  pair  than  that  of  the  human  visual  system, 
albeit  at  the  cost  of  a  high  computational  complexity. 

Despite  the  impressive  performance  of  some  existing  stereo  algorithms,  problems  with  the 
basic  paradigm  of  applying  physical  constraints  arise  already  in  this  relatively  narrow  domain. 
Specifically,  imposing  the  constraints  of  uniqueness  and  smoothness  on  the  matching  process 
may  be  detrimental  in  some  situations.  For  example,  the  smoothness  assumption  breaks 
down  for  scenes  that  include  transparent  objects,  such  as  fences.  Looking  through  a  fence, 
one  may  see  a  tree  to  the  left  of  a  given  bar  in  the  left  eye  and  the  same  tree  to  the  right  of 
the  same  bar  in  the  right  eye.  (It  should  be  noted  that  simple  modifications  of  the  continuity 
constraint  can  overcome  this  problem  [124]).  A  more  serious  problem  with  smoothness  arises 
at  those  locations  in  the  images  where  there  are  depth  discontinuities.  Smoothing  across 
discontinuities  is  bound  to  produce  errors  in  stereo  (and  other  low-level  vision  algorithms 
that  employ  this  constraint,  such  as  shape  from  shading  and  motion;  see  sections  2.1.2,  2.1.4 
and  3.2).  Among  the  recent  studies  that  address  this  problem  are  [45],  [18]  and  [144].  See 
also  [122]  for  an  overview  of  the  use  of  smoothness  constraints  in  low-level  vision. 

There  is  also  evidence  that  human  vision  may  not  always  follow  the  uniqueness  assump¬ 
tion  and  that  people  may  perceive  simultaneously  several  surfaces  corresponding  to  multiple 
matchings  between  elements  in  the  two  images  [167].  In  some  cases  the  relationship  between 
disparities  derived  from  matching  and  perceived  depth  is  not  unequivocal.  For  example,  the 
depth  perceived  in  a  dot  pattern  may  correspond  to  sin  average  disparity,  rather  than  to  one 
of  the  disparities  derived  from  a  possible  matching  [103]. 

A  major  open  problem  in  stereo  is  what  the  representation  at  the  output  of  a  stereo 
system  should  be.  The  disparity  field,  once  computed  by  a  matching  algorithm,  can  be  used 
to  extract  relative  depth  information.  Objects  with  large  disparities  sire  usually  closer  to  the 
viewer  thsm  objects  with  smsdl  disparities  (when  the  opticsil  sixes  of  the  csimeras  are  sdmost 
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parallel).  If  the  location,  the  orientation  and  the  focal  length  of  both  camera*  (or  eye*)  are 
known,  a  simple  geometrical  transformation  can  be  used  to  compute  the  exact  depth  from  the 
disparities.  Typically,  however,  the  precise  camera  parameter*  are  not  known.  Computing 
the  parameters  from  the  disparity  field  is  possible,  though  computationally  expensive  and 
sensitive  to  noise  [62].  It  is  unclear  whether  this  computation  is  necessary  or  whether  relative 
depth  is  sufficient  for  all  practical  purposes  ([7],  [168]). 

3.1.2  Motion 

The  analysis  of  motion  is  similar  to  stereo  in  that  it  involves  the  analysis  of  a  set  of  images 
of  a  given  scene.  The  differences  are  that  the  images  are  taken  in  succession  rather  than 
simultaneously,  more  than  two  images  may  be  available  for  analysis,  and  the  objects  may 
undergo  an  arbitrary  transformation  between  the  images.  Specifically,  in  stereo  it  is  assumed 
that  the  two  images  are  taken  by  two  fixed  cameras  (eyes)  whose  relative  location  is  known, 
at  least  approximately.  When  images  are  taken  at  different  time*,  any  object  can  translate  in 
any  direction,  rotate  and  deform,  so  that  the  geometrical  transformation  between  the  images 
is  not  sufficiently  defined,  unless  some  assumptions  are  made.  Among  the  goals  of  motion 
analysis  are  the  extraction  of  depth  information  for  shape  perception  and  navigation  (possibly 
in  a  less  precise  way  than  stereo),  the  segmentation  of  an  image  into  distinct  objects,  and 
the  identification  of  moving  objects. 

Psychophysical  studies  indicate  that  motion  analysis  alone  is  sufficient  to  extract  shape 
information  when  the  moving  objects  are  rigid  bodies  (that  is,  the  distance  between  any 
two  points  on  an  object  remains  constant  with  time;  [162],  [152]).  The  rigidity  constraint  is 
useful  to  make  the  problem  computationally  tractable  ([152];  see  [153]  and  [56]  for  a  review). 
It  is  derived  from  a  computational  analysis  of  the  problem  and,  similar  to  the  constraints 
imposed  in  stereo  analysis,  it  should  be  applied  with  care.  There  are  many  examples  where 
humans  perceive  nonrigid  motion  (e.g.,  expansion  and  contraction),  even  though  a  rigid 
motion  interpretation  (such  as  a  rotation  in  depth)  is  possible  (see  [4],  [53]).  In  some  cases, 
the  motion  of  nonrigid  objects  leads  to  a  clear  perception  of  shape.  For  example,  watching 
a  person  walking  in  the  dark  with  small  lights  attached  to  the  joints,  one  can  perceive  a 
human  figure  walking  ([71],  [60]).  Although  several  recent  studies  address  the  issue  of  the 
interpretation  of  nonrigid  motion,  the  algorithms  they  develop  usually  allow  only  certain 
classes  of  nonrigid  transformations,  such  as  bending  (see,  e.g.,  [81]  and  [29]). 

Similar  to  the  analysis  of  stereo,  the  analysis  of  motion  can  be  divided  into  two  stages:  the 
matching,  either  of  features,  or  of  regions,  in  the  successive  images,  followed  by  a  computation 
that  relies  on  the  resulting  disparity  information.  Another  approach  involves  the  computation 
of  optical  flow  (the  instantaneous  2D  velocity  field)  instead  of  the  motion  disparity  map. 
This  computation  can  be  presumably  done  by  local  intensity  change  detectors  (e.g.  [172]).  A 
principal  difficulty  in  the  first  approach  is  matching,  whereas  in  the  optical  flow  formulation 
it  is  relatively  high  sensitivity  to  noise,  associated  with  numerical  differentiation  [62].  The 
distinction  between  the  matching  and  the  optical  flow  formulations  parallels  the  difference 
between  long-range  and  short-range  motion  perception  processes  observed  in  psychophysical 
studies  [20]. 

The  matching  problem  is  more  difficult  in  motion  than  in  stereo,  for  example  because  of 
changing  illumination  and  because  of  moving  shadows,  whose  motion  differs  from  the  true 
motion  of  the  objects.  As  in  stereo,  a  smoothness  constraint,  equivalent  to  the  assumption 
that  the  velocities  of  different  features  change  slowly  over  the  image,  appears  plausible  and 
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is  used  by  many  algorithms  ([64],  [53],  [107]).  Typically,  computer  vision  algorithms  en¬ 
force  smoothness  of  velocity  either  over  regions  of  the  image  [64],  or  along  its  contours  [53]. 
Psychophysical  findings  suggest  that  both  ways  of  imposing  smoothness  may  coexist  in  the 
human  visual  system.  For  example,  the  perceived  motion  of  points  on  a  contour  is  similarly 
affected  by  the  motion  of  neighboring  points  and  of  more  distant  points  that  lie  on  the  same 
contour  [108]. 

Given  matched  images,  there  are  many  ways  to  recover  the  three-dimensional  structure 
of  the  objects.  Early  studies  of  this  problem  concentrated  on  the  issue  of  minimal  informa¬ 
tion  needed  to  recover  unique  structure  from  motion  ([152],  [31],  [123],  [104],  [150],  [165]). 
One  major  problem  common  to  the  algorithms  that  recover  structure  from  motion  is  their 
numerical  instability  [153].  To  decrease  the  resulting  sensitivity  to  errors,  motion  informa¬ 
tion  may  be  pooled  over  many  features  (extended  space;  see  [62],  [23]),  or  over  many  time 
frames  (extended  time;  see  [154]).  Human  vision  appears  to  use  both  extended  time  and 
space  analysis  ([71],  [55]).  Some  psychophysical  evidence  suggests  that,  in  disagreement  with 
most  structure  from  motion  algorithms,  the  perception  of  3D  structure  may  not  depend  on 
prior  perception  of  motion  [157]. 

It  is  still  an  open  question  to  what  extent  the  exact  structure  of  objects  is  necessary  to 
fulfill  the  ultimate  goals  of  vision.  Some  shape  information,  such  as  the  classification  of  object 
surfaces  as  convex,  concave,  planar,  or  hyperbolic  (saddle-like)  can  be  obtained  without  the 
complete  recovery  of  the  exact  locations  in  space  of  all  the  points  in  the  image  [166].  For 
specific  narrowly  defined  tasks,  such  as  obstacle  avoidance  in  navigation,  fin  even  simpler 
analysis  may  suffice  [110]. 

2.1.S  Edge  detection 

A  problem  common  to  the  design  of  all  low-level  vision  modules  is  the  choice  of  the  input 
representation.  The  raw  input  to  the  visual  system,  an  array  of  intensity  values  (see  figure  1), 
does  not  suit  well  tasks  that  require  image  matching,  for  example,  because  of  its  sensitivity 
to  noise  and  to  changes  in  the  illumination.  Intensity  edges  (image  locations  where  the 
intensity  changes  significantly)  have  been  proposed  as  a  more  stable  initial  representation 
for  stereo  and  motion.  Different  definitions  of  significant  change  in  intensity  lead  to  different 
algorithms  for  edge  detection  (see  [34],  [6]  and  [54]  for  reviews).  Finding  computationally 
efficient  algorithms  that  capture  those  edges  that  correspond  to  what  we  intuitively  perceive 
as  edges  proved  to  be  a  difficult  problem.  If  edge  detection  is  meant  to  provide  a  cartoon 
of  the  image,  i.e.,  the  set  of  edges  that  has  a  physical  origin,  it  effectively  subsumes  other 
difficult  problems  in  vision,  such  as  figure- ground  separation  and  object  recognition. 

Intensity  edges  may  be  defined  as  those  places  in  the  image  where  the  rate  of  change 
of  intensity  attains  a  local  maximum.  The  derivative  operation  included  in  this  definition 
makes  it  sensitive  to  noise.  Noise  simplification  can  be  reduced  by  smoothing  the  image  before 
subjecting  it  to  the  derivative  operation.  A  computationally  efficient,  biologically  plausible 
edge  detection  operator  based  on  this  approach  is  a  linear  filter  that  combines  smoothing 
(by  a  convolution  with  a  2D  Gaussian)  with  differentiation  (by  the  application  of  a  2D 
Laplacian,  a  rotationally  symmetric  operator).  Equivalently,  the  image  may  be  convolved 
with  the  Laplacian  of  a  Gaussian.  The  edges  are  then  found  by  locating  the  zero-crossings  in 
the  output  of  the  convolution  [93].  Psychophysical  evidence  [164]  suggests  that  zero-crossings 
may  be  involved  in  early  processing  in  human  vision,  along  with  additional  features,  such 
as  intensity  extrema.  (For  an  anatomical  model  of  zero-crossing  detection  in  the  retina  see 
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[133]). 

Since  the  Laplacian  of  Gaussian  operator  is  spatially  symmetric, -it  ignores  the  asymmetry 
of  edges,  which  are  one- dimensional  curves  with  a  preferred  direction.  Same  algorithms 
address  this  problem  by  computing  second  directional  derivatives  of  the  input  [52].  The 
computation  of  the  second  derivative  in  the  direction  of  the  intensity  gradient  has  been  shown 
optimal  for  the  detection  of  oriented  edges  [149].  As  in  stereo,  a  multi-resolution  approach 
(computing  the  edges  at  several  levels  of  smoothing)  proved  useful  in  edge  detection.  A  widely 
used  well-engineered  implementation  of  edge  detection  that  employs  both  approaches  is  due 
to  Canny  ([26];  see  figure  3).  Labeling  edges  according  to  their  physical  origin  (shadows, 
occluding  contours  etc.)  is  a  subject  of  current  research  (e.g.  [44]).  Edge  labeling  may  be 
useful  in  tasks  such  as  object  recognition. 


Figure  3:  Left:  an  image  of  a  natural  scene  (same  as  in  figure  1).  Bight:  intensity  edges 
extracted  from  this  image  by  the  Canny  algorithm  [26]. 


2.1.4  Shape  from  shading  and  shadows 

Shading  information  seems  to  be  quite  important  in  human  shape  perception  [63].  In  the 
perception  of  shape  from  shading,  the  visual  system  must  separate  intensity  changes  that 
are  caused  by  changing  orientation  of  object  surfaces  from  those  due  to  changing  surface 
reflectance  (including  color)  and  illumination.  As  an  example  of  the  perception  of  shape 
from  shading,  consider  the  human  face  in  figure  4.  The  basic  information  an  the  3D  shape  of 
the  face  in  this  image  is  obtained  from  shading  analysis  (e.g.,  the  nose  appears  to  protrude 
from  the  face  because  its  flanks  are  shaded  darker  than  its  tip). 

When  the  illumination  and  the  viewing  direction  are  fixed  and  the  color  is  constant,  the 
shading  of  a  surface  depends  solely  on  its  local  orientation.  When  posed  quantitatively,  the 
problem  of  inferring  shape  from  shading  turns  out  to  be  one  of  the  most  difficult  in  low-level 
vision.  The  main  complication  is  ambiguity:  a  shaded  image  can  be  interpreted  in  many 
different  ways.  For  example,  a  concave  surface,  a  convex  surface  and  a  saddle-like  surface  all 
appear  the  same  from  certain  viewpoints.  If  surface  color  is  unknown  and  variable,  computing 
shape  from  shading  becomes  even  more  difficult. 

Assuming  a  simplified  surface  reflectance  model,  and  given  some  a  priori  knowledge  about 
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Figure  4:  A  shaded  image  of  a  face.  The  basic  information  on  the  3D  shape  of  the  face  in 
this  image  is  obtained  from  shading  analysis  (e.g.,  the  nose  appears  to  protrude  from  the 
face  because  its  flanks  are  shaded  darker  them  its  tip). 


the  objects  in  the  image  (e.g.,  the  depth  along  the  outlines  of  the  objects),  the  exact  shape 
from  shading  problem  becomes  solvable,  though  still  computationally  difficult  ([170],  [69], 
[62]).  A  further  assumption  of  oblique  illumination  may  substantially  reduce  the  computa¬ 
tional  complexity  [115],  although  the  problem  of  self- shadowing,  neglected  by  most  algorithms 
including  [115],  becomes  significant.  In  such  cases,  special  shape  from  shadows  algorithms 
[138]  may  be  employed  to  infer  shape  from  the  distribution  of  shadows  in  the  image.  One 
complicating  factor,  mutual  illumination  (a  secondary  illumination  due  to  the  reflectance  of 
light  by  other  objects  in  the  scene)  is  neglected  by  all  of  the  above  algorithms.  Recent  work 
has  shown  that,  despite  the  simplified  conditions  assumed  by  most  of  the  shape  from  shading 
algorithms,  mutual  illumination  can  make  the  problem  more  ambiguous  than  had  been  pre¬ 
viously  appreciated  [43].  On  the  other  hand,  experiments  with  computer  graphics  systems 
indicate  that  simulating  mutual  illumination  is  important  for  the  creation  of  realistic-looking 
images. 

Recent  psychophysical  data  suggest  that  humans  use  shading  information  in  a  limited  way 
([10],  [147],  [25]).  People  seem  to  use  shading  information  to  build  qualitative  descriptions 
of  object  surface  (e.g.,  in  terms  of  convex  and  concave  regions),  rather  than  to  compute 
exact  shape  from  shading  (in  light  of  the  complexity  of  extracting  shape  from  shading,  these 
findings  may  not  come  as  a  surprise).  The  computation  of  qualitative  shape  from  shading 
may  be  facilitated  by  using  highlights  [17],  which  are  a  nuisance  in  most  exact  approaches 
([16],  [131]).  FinaJy,  humans  appear  to  employ  high-level  heuristics,  such  as  the  assumption 
that  the  scene  is  illuminated  from  above,  to  disambigi;  ate  the  shape  from  shading  problem 
[131]. 
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2.1.5  Color 

In  human  vision,  color  contributes  important  information  for  object  recognition,  although  its 
contribution  is  by  no  means  necessary  for  most  everyday  tasks.  The  human  visual  system 
exhibits  an  impressive  ability  to  infer  the  correct  color  of  objects  under  illumination  that  may 
vary  in  direction,  intensity  and  spectral  content.  Thus,  while  light  arriving  at  the  retina  from 
a  given  surface  patch  under  different  illuminations  may  have  different  spectral  compositions, 
the  patch  would  normally  appear  to  have  the  same  color  in  both  cases.  This  phenomenon 
is  called  color  constancy.  It  has  been  suggested  that  color  constancy  can  ensue  if  we  assume 
that  the  color  of  illuminant  changes  slowly  and  smoothly,  whereas  surface  color  changes 
abruptly.  Many  algorithms  rely  on  this  assumption  to  assess  the  true  color  of  the  surface 
([84],  [90];  [66]  show  that  a  linear  color  operator  can  be  learned  from  examples).  Under 
the  assumption  that  all  colors  appear  in  the  image  with  equal  probabilities,  the  color  of  the 
illuminant  can  be  estimated  by  averaging  color  over  large  neighborhoods  of  the  image  [83]. 
Another  approach  uses  the  color  of  highlights  to  compute  the  color  of  the  illuminant  [85]. 
Psychophysical  evidence  suggests,  however,  that  the  human  visual  system  does  not  use  this 
specific  technique  [67]. 

Humans  perceive  color  by  having  three  different  types  of  receptors  in  the  retina  whose 
peak  sensitivities  are  in  the  red,  green  and  blue  regions  of  the  spectrum.  Thus  the  dimension 
of  the  color  space  perceived  by  humans  is  only  two,  not  considering  the  overall  brightness. 
This  means  that  there  are  many  different  reflectance  functions  that  appear  to  humans  to 
have  the  same  color.  A  principal  component  analysis  of  the  distribution  of  surface  reflectance 
functions  of  natural  objects  and  of  ambient  daylight  reveals,  however,  that  the  number  of 
degrees  of  freedom  needed  is  actually  small.  More  specifically,  the  color  of  many  natural 
objects  can  be  largely  approximated  in  terms  of  only  three  base  functions  (see  [89],  [24]). 

2.1.6  Texture 

The  analyr  of  texture  deals  with  statistical  properties  of  collections  of  features  (textons 
or  elementary  units  of  texture  [74])  that  cover  object  surfaces.  Statistics  of  texture  may 
contain  important  perspective  information  and  may  enable  the  segmentation  of  the  image 
into  distinct  objects.  For  example,  in  the  image  of  a  tilted  uniformly  textured  plane  the 
density  of  elements  diminishes  in  the  direction  away  from  the  viewer,  providing  a  hint  to 
the  orientation  of  the  plane  in  depth.  For  non-planar  objects,  the  distribution  of  texture 
elements  is  more  complex:  a  sphere  sprayed  with  paint  projects  to  a  circle  filled  with  dots,  in 
which  the  density  of  the  dots  is  low  around  the  center  and  increases  towards  the  periphery. 

Differences  in  the  statistical  distribution  of  texture  primitives  may  help  segment  the 
image  into  distinct  objects  (see  [13]  for  a  review;  [160],  [28]).  If  the  analysis  is  based  on 
complex  texture  elements  [12],  it  is  first  necessary  to  identify  the  elements  in  a  hierarchical 
manner  and  then  compute  their  distribution  in  the  image.  The  characterization  of  useful 
textons  (in  particular,  those  involved  in  human  perception  of  texture)  has  been  a  subject  of 
extensive  research  [75].  One  relevant  question  is  that  of  the  relationship  between  texture- 
based  segmentation  and  the  structure  of  the  constituent  textons.  Another  theory  relies  on 
a  statistical  analysis  of  the  raw  intensity  values  in  the  image  [74].  Gradients  of  the  size 
distribution  and  the  density  of  textons  can  be  interpreted  as  depth  cues  for  receding  surfaces 
([46];  cf.  [141]),  through  Fcuv.tr  analysis  [5]  or  statistical  assumptions  about  the  scene  [169]. 
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2.1.7  Occluding  contours 

A  line  drawing  of  a  complex  3D  object,  containing  no  shading,  stereo  or  textural  cues,  may 
provide  information  sufficient  for  its  identification  and  for  the  perception  of  its  shape.  One 
major  source  of  information  in  a  line  drawing  is  the  shape  of  the  object’s  outline,  or  the 
occluding  contour.  Marr  [91]  argued  that  by  tnemselves  the  occluding  contours  are  not  suffi¬ 
cient  for  shape  classification.  He  therefore  concluded  that  our  ability  to  perceive  the  shape  of 
an  object  from  its  outline  is  achieved  through  the  imposition  of  strong  constraints  on  possible 
shapes  of  objects,  i.e.,  that  most  shapes  can  be  described  in  terms  of  volumetric  primitives 
called  generalized  cones  [15].  Koenderink  [80]  showed,  however,  that  such  restrictions  are  not 
necessary,  and  that  occluding  contours  carry  important  information  about  shape.  Specifi¬ 
cally,  he  showed  that  a  concave  segment  of  the  object’s  boundary  implies  that  the  object’s 
surface  is  locally  hyperbolic  (saddle-like),  while  a  convex  occluding  contour  implies  a  locally 
elliptic  (convex  or  concave)  surface.  It  should  be  noted  that  saddle-like  regions  are  good 
candidates  for  object  segmentation,  and  that  human  vision  appears  to  use  simple  heuristic 
rules  of  this  type  in  achieving  descriptions  of  objects  in  terms  of  their  parts  [61]. 

2.1.8  Low-level  vision:  an  interim  summary 

In  distinction  from  other  theoretical  approaches,  computational  study  of  vision  combines  an 
analysis  of  the  computational  strategies  employed  by  biological  systems  with  psychological 
and  physiological  investigations,  and  with  building  artificial  vision  systems.  Thinking  in 
terms  of  representations  and  their  transformations  and  subjecting  the  resulting  theories  to 
empirical  tests  proved  especially  fruitful  in  the  domain  of  low-level  vision,  or  the  bottom-up 
recovery  of  visual  properties  of  the  environment.  The  interaction  between  computational, 
biological  and  machine  studies  led  to  a  better  understanding  of  the  difficulties  involved  in 
low-level  vision,  and  to  a  reopening  of  basic  questions  concerning  the  output  representation, 
the  physical  constraints  and  the  modularity  of  low-level  processing.  These  issues  become  even 
more  important  at  the  higher  levels  of  visual  processing,  where  at  present  little  feedback  is 
available  from  empirical  investigations. 

2.2  Middle  vision 

Two  classes  of  visual  operations  appear  to  fall  outside  the  scope  of  both  the  input-driven  low- 
level  processes  and  the  high-level  goal-oriented  ones.  The  first  class  includes  processes  that 
operate  on  the  “primal  sketch”  representation  of  the  input,  obtained  in  the  first  stage  of  vision 
[92].  The  purpose  of  these  processes  is,  in  general,  the  completion  and  the  enhancement  of 
the  primal  sketch.  These  operations  sue  lateral  in  the  sense  that  they  do  not  necessarily  rely 
on  either  a  more  detailed  input  than  already  available,  or  on  higher-level  cues.  Examples 
of  phenomena  that  may  involve  such  processes  are  boundary  completion  (figure  5,  left), 
interpolation  of  depth  and  motion  information  to  regions  that  lie  in  between  features  used 
to  compute  this  information,  spatial  grouping  of  features  (figure  5,  right)  and  increased 
perceived  saliency  of  some  contours  relative  to  others  ([92],  [27],  [137]). 

Intermediate-level  processes  of  the  second  type  may  be  described  computationally  as  mul¬ 
tipurpose  visual  routines  [155],  invoked  at  need  by  higher-level  modules.  These  routines  may 
enhance  the  intermediate  representation  by  making  explicit  spatial  properties  and  relations 
such  as  contiguity  (of  a  contour)  and  insideness  (of  a  feature  with  respect  to  the  contour). 
This  type  of  information  is  useful  in  many  visual  tasks,  such  as  recognition  and  navigation, 
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Figure  5:  Left:  Kanisza’s  triangles,  an  example  of  boundary  completion.  Illusory  contours 
that  form  a  triangle  are  seen  to  occlude  three  black  disks.  Right:  an  example  of  hierarchical 
grouping  by  shape  similarity  and  physical  proximity.  At  the  highest  level,  the  small  shapes 
form  a  circle. 

but  is  too  abstract  and  resource-intensive  to  warrant  automatic  bottom-up  computation.  An 
additional  kind  of  routines  may  be  involved  in  the  computational  substrate  of  visual  atten¬ 
tion,  where  the  basic  operations  are  indexing  (marking  a  location  in  the  visual  field)  and  the 
shifting  of  the  processing  focus  to  the  marked  location  [155]. 

Grouping  of  edges  receives  increased  attention  in  recent  studies  of  middle  vision.  Lowe  [87] 
proposed  to  start  the  grouping  by  detecting  image  properties  that  are  likely  to  convey  useful 
information  about  the  3D  world.  Among  such  properties  are  collniearity  and  parallelism:  the 
chance  that  edges  that  appear  collinear  or  parallel  in  the  image  do  so  only  by  accident  (due 
to  the  particular  viewpoint)  is  small.  Another  example  of  a  middle  vision  operation  on  edges 
is  the  computation  of  the  saliency  of  curve  segments  by  a  simple  local  mechanism  [137].  In 
many  cases,  this  operation  can  isolate  perceptually  important  curves  in  a  noisy  edge  map 
(produced,  e.g.,  by  the  Canny  method).  Our  last  example,  which  resembles  visual  routines  in 
that  it  operates  “laterally”  on  the  intermediate  representations,  is  the  integration  of  different 
low-level  cues  (e.g.,  [120],  [3]). 

The  advances  in  the  computational  theory  of  integrated  low  and  intermediate  level  vi¬ 
sion  are  beginning  to  draw  the  attention  of  psychologists  and  physiologists.  Psychophysical 
evidence  indicating  that  the  human  visual  system  may  indeed  combine  different  cues  into  an 
integrated  representation  can  be  found  in  [27].  Some  of  the  recent  experiments  that  address 
the  issue  of  intermediate  level  vision  and  support  the  existence  of  boundary  tracing  and  in¬ 
dexing  as  visual  routines  are  described  in  [72]  find  [128].  An  example  of  a  neurophysiological 
study  of  attention  in  monkey  is  [105],  which  describes  the  modification  of  the  receptive  field 
of  an  extrastriate  cell  by  attention  shifts.  Cells  that  respond  to  abstract  visual  entities  such 
as  illusory  contours  were  found  in  the  striate  cortex  of  the  monkey  (159). 

2.3  Object  recognition 

An  intelligent  visual  system  is  expected  to  allow  its  host  to  navigate  through  the  environment 
and  possibly  to  manipulate  any  objects  present  in  it.  Keeping  a  visual  library  of  models  of 
potentially  important  objects  enables  such  a  system  to  recognize  an  object  and  to  select  an 
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appropriate  behavior  based  upon  its  past  experience.  Consequently,  computational  theories 
of  object  recognition  postulate  that  there  exist  in  the  visual  system  representations  of  fa¬ 
miliar  objects  and  scenes.  To  recognize  an  object,  the  system  compares  it  with  each  of  the 
stored  models,  or  templates  [112].  An  estimate  of  the  goodness  of  fit  between  an  object 
and  a  template  can  be  obtained,  e.g.,  by  a  correlation-like  operation,  in  which  the  two  are 
superimposed  and  the  proportion  of  pixels  that  agree  in  their  values  is  computed. 


Figure  6:  The  appearance  of  a  3D  object  can  depend  strongly  on  the  viewpoint.  The  image 
on  the  right  is  of  the  same  object  as  the  image  on  the  left,  rotated  in  depth  by  90°.  The  dif¬ 
ference  between  the  two  images  illustrates  the  difficulties  encountered  by  any  straightforward 
template  matching  approach  to  3D  object  recognition. 

Recognition  of  3D  objects  seen  from  arbitrary  viewpoints  is  difficult  because  an  object’s 
appearance  may  vary  considerably  depending  on  its  pose  relative  to  the  observer  (see  figure  6). 
Thus,  while  straightforward  template  matching  [2]  may  be  useful  in  the  recognition  of  2D 
objects  in  a  controlled  environment,  it  will  not  work  for  3D  object  recognition,  unless  a 
template  is  stored  for  each  view  that  will  ever  have  to  be  recognized.  Although  the  extent  to 
which  people  can  recognize  novel,  radically  different,  views  of  3D  objects  is  not  clear  ([134], 
[135]),  we  obviously  do  have  some  ability  to  generalize  recognition  to  novel  views.  This  ability 
is  termed  visual  object  constancy  (see  e.g.  [65]). 

Most  of  the  schemes  for  object  recognition  proposed  to  date  can  be  divided  into  three 
main  classes,  according  to  their  approach  to  the  problem  of  object  constancy  [156].  The 
first  approach  assumes  that  objects  have  certain  invariant  properties  that  are  common  to  all 
their  views  and  different  between  object  classes  (in  practice,  this  twofold  assumption  proved 
difficult,  if  not  impossible,  to  satisfy).  Under  this  approach,  objects  are  represented  by  vectors 
of  property  values,  or,  equivalently,  by  points  in  a  multidimensional  space.  Recognition  then 
becomes  a  problem  of  clustering  in  this  space  (e.g.  [36]). 

A  second  approach  to  object  recognition  relies  on  the  decomposition  of  objects  into  sim¬ 
ple  generic  constituents.  The  components  are  generic  in  the  sense  that  all  objects  can  be 
described  in  their  terms.  For  a  structural  approach  to  succeed,  the  components  must  be 
rapidly  and  reliably  detected  in  any  given  view  of  an  object,  otherwise  the  problem  of  com¬ 
ponent  identification  tends  to  become  as  complex  as  object  recognition  itself.  An  analogous 
situation  results  if  the  breakdown  of  objects  into  components  is  too  fine-grained,  e.g.  when 
the  components  are  individual  edge  elements  or  lines.  In  that  case,  component  detection  is 
relatively  easy,  but  their  relationships  become  complicated. 

Some  older  object  recognition  systems  (e.g.  [22]),  and  at  least  one  psychological  theory 
[14],  describe  objects  in  terms  of  elongated  volumes  called  generalized  cylinders  ([15],  [94]). 
The  use  of  generalized  cylinder  primitives  may  be  regarded  as  a  compromise  between  the 
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conflicting  requirements  of  generality  and  detectability  of  object  components.  The  description 
of  an  object  under  this  representation  scheme  is  said  to  be  object-centered,  in  the  sense  that 
it  involves  the  relations  among  the  object’s  parts  that  do  not  depend  on  the  viewer’s  position. 

Variants  of  the  decomposition  approach  that  use  contour-based  primitive  descriptions  are 
relatively  successful  in  certain  domains,  such  as  industrial  part  recognition.  The  two  major 
limitations  of  this  approach  are  that  in  many  cases  the  precise  shape  of  the  object  matters 
more  than  its  decomposition  into  parts,  and  that  many  objects  have  no  natural  structural 
description. 

A  third  major  approach  to  object  recognition,  which  may  be  termed  normalization  [114] 
or  alignment  [156],  addresses  the  requirement  for  precise  characterization  of  object  shapes 
by  employing  a  modified  form  of  template  matching.  This  approach  may  be  illustrated  using 
the  simple  example  of  the  recognition  of  a  2D  shape  such  as  the  letter  A,  whose  size  is  larger 
than  that  of  A’s  template  stored  in  the  system  [109].  In  this  case,  template  matching  can  be 
still  used,  provided  the  input  shape  is  first  “normalized”  by  scaling  it  down  to  the  standard 
size. 

The  normalization  approach  can  be  extended  to  the  domain  of  3D  object  recognition.  To 
recognize  a  rigid  3D  object,  one  may  first  normalize  its  appearance  by  undoing  the  trans¬ 
formations  (such  as  3D  rotation)  by  which  it  differs  from  a  stored  model.  Combined  with 
algorithms  that  compute  the  necessary  transformations,  the  normalization  method  has  been 
recently  applied  to  the  recognition  of  natural  objects  from  unconstrained  viewpoints  ([87], 
[156]).  As  the  main  idea  of  normalization  is  the  ultimate  use  of  template  matching,  3D 
recognition  schemes  based  on  normalization  typically  involve  3D  object-centered  templates 
or  models. 

State  of  the  art  object  recognition  systems  are  typically  based  either  on  sophisticated 
search  techniques  ([47],  [49]),  or  on  variants  of  the  normalization  approach  ([19],  [88],  [145], 
[68],  [82]).  These  systems  are  mainly  useful  in  tasks  such  as  industrial  part  recognition,  be¬ 
cause  of  their  reliance  on  strict  geometrical  models.  In  addition,  search-based  systems  tend 
to  perform  poorly  in  cluttered  environments,  where  segmenting  the  object  from  the  back¬ 
ground  is  especially  difficult.  A  partial  solution  to  the  first  shortcoming  has  been  offered  in 
the  form  of  parametrized  object  models.  For  example,  a  pair  of  scissors  can  be  recognized 
irrespective  of  the  angle  between  the  two  blades,  if  this  angle  is  considered  as  an  additional 
dimension  of  the  search  space,  along  with  the  viewpoint  parameters  (this,  of  course,  aggra¬ 
vates  the  complexity  of  the  search).  The  problem  of  sensitivity  to  clutter  and  noise  may 
in  principle  be  approached  through  the  use  of  distinctive  labels  for  recognition  primitives 
(edges,  comers  etc.).  It  is  not  clear,  however,  what  kind  of  labeling  would  be  easy  enough 
to  compute  and  at  the  same  time  informative  enough  to  be  useful  in  recognition.  Another 
possibility  here  is  to  use  middle  vision  techniques  (section  2.2)  to  enhance  the  image  before 
recognition  is  attempted.  An  additional  problem,  common  to  most  recognition  methods,  is 
that  of  indexing:  choosing  a  subset  of  models  that  is  likely  to  include  a  potential  match  to 
the  input,  rather  than  trying  to  match  the  input  to  all  known  models.  Finally,  we  remark 
that  automatic  acquisition  of  object  models  appears  to  be  the  main  difficulty  associated  with 
those  approaches  that  rely  on  3D  geometrical  representation. 

The  two  related  questions,  which  recognition  scheme  is  employed  by  the  human  visual 
system  and  what  is  the  nature  of  visual  object  representation,  are  at  present  unresolved  [65]. 
In  this  respect,  our  understanding  of  recognition  lags  considerably  behind  our  understanding 
of  low-level  tasks  such  as  motion  detection  find  stereopsis.  We  discuss  possible  reasons  for 
this  situation  in  section  3. 
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2.4  Biophysics  of  computation 

The  possibility  to  separate  abstract,  or  computational,  from  concrete,  or  implementational, 
aspects  of  visual  perception  is  a  central  feature  of  the  approach  to  the  study  of  vision  ad¬ 
vocated  by  Marr  and  Poggio  [96].  Experience  of  the  last  fifteen  years  suggests  that  this 
view  is  too  idealized,  and  that  the  complexity  of  most  problems  in  vision  renders  imprac¬ 
tical  algorithms  that  ignore  constraints  imposed  by  the  available  hardware.  Moreover,  the 
choice  of  low-level  visual  modules  for  which  a  computational  theory  is  sought  depends  on  the 
hardware.  For  example,  depth  can  be  recovered  either  through  binocular  stereo  or  through  a 
laser  rangefinder.  The  computational  problems  that  need  to  be  addressed  in  these  two  cases 
are  clearly  different. 

One  important  source  of  differences  between  solution  classes  that  are  available  to  bio¬ 
logical  vision  and  those  that  best  suit  machine  vision  is  the  relevant  computational  primi¬ 
tives.  Brains  process  large  amounts  of  low-precision  data  at  a  relatively  slow  rate  and  in  a 
highly  parallel  fashion,  while  most  digital  computers  are  serial,  fast  and  can  carry  out  high- 
precision  arithmetic  operations  (see,  e.g.,  [66],  [136]).  A  gradual  realization  of  the  importance 
of  hardware-related  constraints  led  to  an  increasing  cooperation  between  the  fields  of  neuro¬ 
science  and  computational  vision.  One  research  goal  in  this  field  is  to  find  out  “what  are  the 
biophysical  mechanisms  underlying  information  processing  and  how  are  these  mechanisms 
used  to  perform  specific  tasks”  ([78],  p.640). 

Although  the  first  detailed  models  of  a  neuron  were  proposed  in  the  fifties  [58],  compu¬ 
tational  biophysics  is  still  at  an  early  stage  of  development.  The  study  of  neuronal  function 
is  now  approached  at  different  levels,  from  that  of  the  biophysics  and  biochemistry  of  mem¬ 
branes  to  the  level  of  ensembles  of  neurons.  In  particular,  functional  understanding  is  sought 
through  anatomical  and  physiological  investigation,  tightly  coupled  with  computer  simula¬ 
tion  (see  [32],  [79]  for  recent  collections  of  papers  in  the  field).  Biochemical  and  electrical 
mechanisms  have  been  invoked  as  explanations  of  phenomena  such  as  retinal  adaptation  and 
spatial  and  temporal  filtering.  Models  that  involve  networks  of  neurons  with  simple  exci¬ 
tatory  and  inhibitory  connections  have  been  proposed,  among  other  computational  tasks  in 
vision,  for  directional  selectivity  in  the  retina  ([8],  [148]),  for  the  computation  of  a  smooth 
velocity  field  [53],  and  for  the  winner-take-all,  or  maximum,  operation  [173]. 

3  Case  studies 

From  the  preceding  section,  it  appears  that  at  the  low  levels  of  the  visual  processing  the 
developments  in  machine  vision  closely  paralleled  biological  and  psychological  findings.  In 
high  level  tasks  such  as  recognition,  most  machine  vision  approaches  bear  less  resemblance  to 
their  putative  counterparts  in  biological  systems.  We  identify  radical  differences  between  the 
nature  of  representation  at  the  low  and  high  levels  of  the  human  visual  system  as  a  possible 
cause  of  this  distinction.  We  support  this  view  by  an  analysis  of  three  characteristic  cases. 
The  first  case  is  the  measurement  of  visual  motion,  a  low-level  vision  process  whose  objective, 
the  computation  of  a  projected  2D  velocity  field,  is  well-defined  in  representational  terms  and 
has  support  in  biological  studies.  The  second  case  deals  with  the  integration  of  low-level  visual 
cues,  a  middle  vision  process  whose  biological  reality  has  not  yet  been  demonstrated.  The 
discussion  of  integration  illustrates  some  of  the  engineering  aspects  of  vision  research.  The 
third  case,  recognition  of  three-dimensional  objects,  is  an  example  of  a  field  still  in  search 
for  a  computationally  efficient,  biologically  plausible  representation  paradigm,  and  in  which 
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current  engineering  solutions  do  not  seem  to  converge  on  human-like  performance. 

S.l  The  measurement  of  visual  motion:  well-defined  representations  allow 
a  principled  solution 

A  standard  formulation  of  the  computation  of  visual  motion  distinguishes  between  two  pro¬ 
cessing  stages  [57].  In  the  first  stage,  the  movement  in  the  changing  2D  image  is  measured.  In 
the  second  stage,  motion  estimates  obtained  through  this  measurement  are  used  in  different 
ways,  e.g.,  in  navigation,  or  in  the  recovery  of  the  3D  layout  of  the  environment. 

The  problem  confronted  by  the  visual  system  in  the  first  stage  of  motion  perception 
may  be  posed  as  the  computation  of  a  projected  2D  velocity  field,  i.e.,  the  assignment  of  a 
velocity  vector  to  each  feature  in  the  image  (alternative,  more  qualitative  formulations  are 
conceivable,  but  will  not  be  discussed  here).  In  this  computation,  the  visual  system  must 
rely  solely  on  the  changes  in  the  light  intensity  patterns  projected  on  the  retina.  In  general, 
many  possible  movements  may  give  rise  to  the  same  changes  in  the  retinal  illumination. 
The  situation  is  further  aggravated  if  the  measurement  of  visual  motion  is  carried  out  by  a 
mechanism  which  examines  only  a  limited  area  of  the  image. 

The  ability  of  a  limited-area  motion  detection  mechanism  to  extract  only  partial  infor¬ 
mation  about  the  real  2D  velocity  field,  called  the  aperture  problem  ([161],  [98],  [64]),  may 
be  illustrated  by  the  following  example.  Consider  an  extended  oriented  pattern  in  the  image, 
such  as  an  intensity  edge,  moving  behind  a  relatively  small  aperture,  representing  the  limited 
area  of  the  image  analyzed  by  a  motion  detection  mechanism.  Because  of  the  aperture,  it 
is  only  possible  to  perceive  the  movement  of  the  edge  in  the  direction  perpendicular  to  its 
orientation  (figure  7).  Some  algorithms  that  incorporate  smoothing  (equivalent  to  using  in¬ 
formation  from  extended  portions  of  the  image)  can  solve  this  problem.  An  example  of  such 
an  algorithm  is  given  below.  It  should  be  noted  that  differential  formulations  of  velocity  field 
computation  that  yield  a  single  equation  per  image  point  (e.g.  [64])  are  inherently  sensitive 
to  the  aperture  problem.  A  complete  2D  velocity  may  be  recovered,  at  the  expense  of  an 
increased  sensitivity  to  noise,  by  writing  down  second-order  differential  equations  [158]. 

The  output  of  a  collection  of  limited-aperture  motion  detectors  must  be  further  processed 
to  recover  a  better  approximation  to  the  true  velocity  field.  According  to  Marr’s  approach, 
this  can  be  done  by  constraining  the  solution  to  the  recovery  problem  to  comply  with  prior 
assumptions  that  reflect  the  physical  nature  of  the  problem.  The  assumption  that  the  velocity 
field  must  be  smooth  ([98],  [64])  proved  to  be  a  good  compromise  between  physical  reality  and 
computational  convenience.  Hildreth  [53]  incorporated  this  assumption  into  the  measurement 
of  motion  by  formulating  the  computational  problem  as  constrained  minimisation.  The  true 
velocity  V  was  estimated  by  minimizing  an  error  functional  0  that  expressed  a  compromise 
between  the  requirement  of  smoothness  and  the  compliance  of  the  velocity  component  normal 
to  the  image  contours  with  the  measured  data: 

0  =  /  (ts1  ) +  (it? )  *W[V  * 

where  u-1  is  the  measured  velocity  in  the  direction  u1  perpendicular  to  the  contour,  0  is  a 
weighting  factor  that  expresses  the  confidence  in  the  measured  velocity  constraints,  and  the 
integral  is  taken  along  the  image  contours  (in  comparison,  other  formulations,  e.g.,  [64],  used 


16 


Figure  7:  An  illustration  of  the  aperture  problem.  Left:  a  bar  E  is  moving  behind  a  small 
aperture  A,  so  that  its  ends  are  not  visible.  Only  the  component  uj.  of  the  bar’s  velocity 
that  is  perpendicular  to  the  bar’s  orientation  can  be  measured  through  the  aperture.  Right: 
an  example  of  a  circle  translating  to  the  right.  The  component  perpendicular  to  the  circle’s 
contour,  which  would  be  computed  by  aperture-limited  operators,  does  not  correspond  to 
the  true  motion  of  the  circle. 

area-based  constraints).  The  first  term  in  the  above  expression  corresponds  to  an  explicit 
imposition  of  the  smoothness  constraint,  used  to  resolve  the  ambiguity  illustrated  in  Figure  7. 

Computing  the  maximally  smooth  velocity  field  consistent  with  the  available  evidence 
was  shown  by  Hildreth  [53]  to  model  a  number  of  aspects  of  the  human  motion  perception 
mechanism,  including  visual  illusions  in  which  the  algorithm  and  human  vision  fail  in  similar 
ways.  (Algorithms  that  use  areabased  formulations,  e.g.,  [172],  are  also  compatible  with 
psychophysical  data.)  Hildreth’s  algorithm  was  subsequently  shown  to  belong  to  a  general 
framework  for  approaching  the  so  called  inverse  problems,  of  which  the  reconstruction  of 
velocity  field  by  aperture-limited  detectors  is  a  special  case  [122].  Following  the  theoretical 
analysis  of  the  problem,  Movshon  et  al.  [106]  discovered  what  could  be  the  physiological 
correlates  of  the  two  successive  stages  of  motion  measurement:  cells  in  cortical  area  VI 
that  are  sensitive  to  the  direction  of  motion  in  such  a  way  that  they  can  provide  only  the 
component  perpendicular  to  image  contours,  and  cells  in  area  MT  that  appear  to  combine 
the  outputs  of  simpler  detectors  to  provide  the  true  direction  of  motion  in  2D. 

3.2  Integration  of  low-level  vision  cues:  refining  low-level  representations 

Biological  vision  systems  are  more  robust  and  flexible  than  any  existing  computer  vision 
system,  even  in  the  relatively  simple  domain  of  low-level  vision.  One  reason  for  this  may 
be  that  biological  vision  combines  information  from  different  low-level  cues  to  form  richer 
and  more  robust  intermediate  representations.  Since  little  is  known  about  the  nature  of 
intermediate  representation  in  human  vision,  attempts  to  integrate  low-level  modules  tend 
to  rely  on  engineering  common  sense  and  on  mathematical  tools  such  as  probability  theory. 

The  task  of  integrating  low-level  modules  in  computer  vision  encounters  difficulties  such 
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as  noisy  data  and  differences  between  the  output  representations  used  by  the  individual  mod¬ 
ules.  Most  algorithms  that  attempt  integration  tend  to  deal  only  with  modules  that  process 
related  cues  (e.g.,  stereo,  vergence  and  focus  [1]).  As  as  example  of  a  more  comprehensive 
integration  effort,  we  shall  describe  the  MIT  Vision  Machine,  a  system,  built  around  a  par¬ 
allel  supercomputer,  whose  purpose  is  the  integration  of  stereo,  motion,  color,  texture,  and 
intensity  edge  data  ([120],  [44]). 

The  two  basic  observations  behind  this  particular  approach  to  integration  are  (i)  that 
physical  edges,  such  as  surface  orientation  discontinuities  and  object  boundaries  in  the  scene, 
cause  the  appearance  of  discontinuities  in  the  output  of  at  least  some  of  the  low-level  visual 
modules,  and  (ii)  that  a  generalized  edge  map  would  constitute  a  highly  useful  representation 
of  a  scene  (consider,  as  an  illustration,  how  informative  can  a  mere  cartoon  of  a  scene  be). 
Thus,  combining  information  from  different  cues  along  edges,  instead  of  over  regions,  may 
result  in  a  computationally  tractable  and  useful  integrated  representation.  In  many  cases, 
the  physical  origin  of  the  edges  in  this  representation  may  also  be  deduced.  For  example,  if 
it  is  known  that  along  an  edge  there  is  no  depth  discontinuity  (i.e.,  the  stereo  disparities  are 
continuous)  and  no  color  discontinuity,  it  may  be  assumed  that  the  edge  is  due  to  a  shadow. 
A  physical  classification  of  the  edges  obtained  through  the  integration  of  different  cues  may 
later  be  used  in  recognition  and  navigation. 

Edge-based  integration  should  start  with  a  reliable  map  of  discontinuities  in  the  individual 
low-level  cues.  Since  the  output  of  most  low-level  modules  by  themselves  is  unreliable,  one 
should  choose  the  module  whose  performance  is  the  most  stable  and  consistent  and  use  its 
output  as  the  basis  for  the  integration  process.  A  good  candidate  for  the  basic  representation 
is  an  intensity  edge  map,  computed  by  the  Canny  algorithm  [26].  Assuming  that  some  of 
the  other  edges,  corresponding  to  discontinuities  in  depth,  color  and  texture,  are  also  found 
by  this  edge  detector,  one  can  improve  the  reliability  of  the  depth,  color  and  texture  maps 
by  forcing  discontinuities  in  these  maps  to  align  with  the  intensity  edges  found  by  the  more 
reliable  edge  detector. 

Since  edge  detection  leads  to  noise  amplification,  unless  the  process  includes  sufficient 
smoothing,  the  above  approach  to  integration  faces  a  dilemma.  If  the  smoothing  is  too 
strong,  many  of  the  discontinuities  necessary  for  integration  will  be  lost.  Stereo  disparities, 
for  example,  should  in  general  be  smoothed  almost  everywhere  in  the  image,  except  at  the 
boundaries  between  objects.  The  location  of  these  is,  of  course,  not  known  at  the  time 
the  disparity  map  is  computed,  that  is,  before  integration.  Different  techniques  have  been 
suggested  to  circumvent  this  problem,  by  performing  a  restricted  smoothing  ([144],  [18]). 
The  MIT  Vision  Machine  uses  for  this  purpose  a  statistical  method,  originally  developed 
for  image  reconstruction  ([45],  [99]).  This  method  employs  Markov  Random  Fields  (MRFs), 
a  mathematical  structure  that  specifies  the  probability  for  a  certain  parameter  to  attain  a 
specific  value  at  an  image  point  in  terms  of  the  values  of  the  parameter  at  the  neighboring 
points.  The  value  at  a  given  site  is  allowed  to  affect  its  neighbor’s  value  only  if  there  is  no 
discontinuity  between  the  two.  The  placement  of  the  discontinuities  in  this  formalism  is  in 
itself  statistical:  the  state  of  the  system  is  allowed  to  fluctuate,  with  edges  being  turned  on 
and  off,  until  a  stable  configuration  is  obtained.  This  configuration  is  usually  a  less  noisy 
version  of  the  original  image.  Since  the  MRF  technique  uses  an  explicit  discontinuity  map,  it 
appears  well-suited  for  an  edge-based  integration  system.  The  integration  algorithm  in  such 
a  system  may  include  the  following  steps: 

1.  Apply  different  low  level  vision  algorithms,  in  parallel. 


18 


2.  Smooth  the  output  of  each  module  with  an  MRF,  constraining  discontinuities  to  the 
locations  of  intensity  edges,  in  parallel. 

3.  Use  the  discontinuities  in  the  MRF  of  each  module  to  label  edges  according  to  their 
physical  origin,  i.e.,  occluding  boundaries,  corners,  shadow  edges,  specularity  edges, 
color  edges. 

4.  If  necessary,  return  to  step  2  and  use  the  labeled  edges  to  improve  the  quality  of  the 
output. 

An  example  of  the  output  of  this  system  using  real  images  is  given  in  figure  8. 
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Figure  8:  A  cartoon-like  representation  of  two  objects  segmented  from  the  background  using 
stereo,  motion,  color  and  texture  cues. 

At  present,  this  approach  still  falls  far  short  of  achieving  human-like  performance  in  scene 
segmentation  and  in  the  classification  of  image  edges.  The  major  open  problems  in  the  MRF- 
based  approach  to  integration  are  the  poor  initial  output  of  some  of  the  low-level  modules  and 
the  need  for  a  manual  tuning  of  the  MRF  parameters.  Furthermore,  it  is  not  clear  whether  a 
simple  and  qualitative  rule-based  approach  would  not  perform  better  than  formal  approach 
of  combining  quantitative  data. 

3.3  Object  recognition:  looking  for  the  right  representation 

It  is  difficult  to  find  in  the  study  of  high-level  vision  a  parallel  to  the  achievements  of  the 
combined  mathematical,  psychological  and  physiological  approach  that  advanced  our  under¬ 
standing  of  low-level  visual  tasks  such  as  motion  and  stereo.  In  low-level  vision  many  physical 
assumptions  have  been  found  useful  in  dealing  with  the  ill-posedness  of  inverse  optics  or  the 
reconstruction  problem.  In  comparison,  in  high-level  problems  such  as  object  recognition  rel¬ 
evant  physical  constraints  are  scarce  and  key  computational  issues,  in  particular  the  nature 
of  representation,  are  highly  controversial. 

Consider  the  task  of  recognizing  objects  that  belong  to  any  fixed  set,  such  as  a  collection 
of  typefaces.  The  major  problem  confronted  by  the  visual  system  in  this  case  is  that  of  noise. 
For  example,  machines  can  be  trained  to  recognize  complicated  and  diverse  2D  objects  such 
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as  printed  characters  that  come  in  a  variety  of  sizes  and  typefaces  [76],  with  an  efficiency 
approaching  that  of  a  human.  In  another  example  of  a  restricted  domain,  the  interpretation 
of  polyhedral  scenes  with  shadows,  relying  on  a  complete  enumeration  of  possible  types  of 
polyhedral  junctions,  has  been  demonstrated  by  Waltz  as  early  as  1975  [163].  In  both  these 
cases,  the  finite  domain  places  an  effective  constraint  on  the  possible  solutions. 

When  the  shape  of  the  objects  is  allowed  to  change  continuously,  but  in  a  principled 
manner  and  without  affecting  their  basically  discrete  and  finite  classification,  some  physical 
assumptions  still  apply  and  may  be  used  to  constrain  the  solution.  In  handwritten  character 
recognition,  for  example,  phenomenological  knowledge  of  the  physics  erf  motor  control  in 
handwriting  can  be  easily  formulated  in  mathematical  terms  and  leads  to  a  simplification  of 
the  recognition  process  [39].  In  comparison,  when  the  problem  is  to  recognize  an  arbitrary 
3D  shape  from  any  possible  viewpoint,  physical  constraints  can  be  applied  to  compensate  for 
only  one  source  of  image  variability  —  the  imaging  process.  The  other  part  of  the  problem, 
the  variability  of  the  intrinsic  3D  shape  of  the  objects,  remains  unsolved. 

In  many  recognition  problems,  the  nearest  conceivable  thing  that  resembles  a  physical 
constraint  seems  too  implausible  or  too  impractical  to  be  of  any  use.  Consider,  for  example, 
face  recognition,  an  important  visual  task  in  primates,  and  a  challenging  application  of 
machine  vision.  A  face  recognition  system  that  employs  geometric  models  is  bound  to  fail 
when  confronted  with  a  familiar  face  bearing  a  novel  expression,  even  if  a  great  number 
of  templates  corresponding  to  previously  encountered  expressions  are  stored.  Applying  the 
physical  constraint  method  in  this  case  amounts  to  the  inclusion  of  the  knowledge  of  facial  and 
cranial  anatomy  in  the  recognition  algorithm,  an  improbable  requirement.  It  just  might  turn 
out  that  a  feasible  face  recognition  system  would  have  to  rely  on  stored  examples  of  familiar 
faces  bearing  prototypical  expressions  more  than  on  an  algorithmic  approach  of  computing 
some  standard  face  representation  using  anatomical  constraints. 

Even  if  standard  representations,  or  models,  of  3D  objects  are  involved  in  recognition,  it 
is  not  clear  what  method  of  modeling  is  computationally  most  efficient,  and  what  method  (or 
methods)  is  employed  in  human  vision.  A  major  point  of  controversy  as  to  the  nature  of  3D 
object  representation  regards  the  question  of  whether  it  should  be  object-centered  (symmet¬ 
ric  with  respect  to  the  object  itself),  as  advocated  by  Marr,  or  viewer- centered  (dependent 
upon  the  viewer’s  position  relative  to  the  object).  Although  in  principle  a  3D  object-centered 
representation  can  be  readily  constructed  from  a  collection  of  2|D  viewer-centered  ones,  in 
practice  building  a  3D  object- centered  model  only  to  use  it  later  by  comparing  its  2D  pro¬ 
jections  with  input  images  ([87],  [145],  [156])  seems  to  be  redundant.  Similarly,  it  appears 
equally  unlikely  that  the  human  visual  system  goes  through  the  effort  of  constructing  com¬ 
putationally  unwarranted  3D  models,  or,  alternatively,  that  it  necessarily  reconstructs  the 
third  dimension  of  an  input  object  before  recognizing  it  (after  all,  we  do  recognize  everyday 
objects  in  visually  impoverished  line  drawings). 

Humans  construct  their  library  of  object  models  through  experience.  It  appears  that  3D 
machine  recognition  systems  that  use  object-centered  models  fail  to  match  this  crucial  com¬ 
ponent  of  the  human  performance  in  recognition.  The  few  3D  machine  recognition  systems 
that  are  designed  to  learn  object  representations  from  examples  ([70],  [151],  [119])  employ 
viewer-based  rather  than  object- based  representation.  Some  evidence  to  the  effect  that  the 
ultimate  representation  need  not  be  object- centered  is  already  available.  This  evidence  falls 
into  three  classes,  psychophysical,  physiological  and  computational,  discussed  separately  be¬ 
low. 
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3.3.1  Hints  from  psychophysics 

The  most  straightforward  way  to  investigate  the  nature  of  object  representation  in  long-term 
memory  is  to  test  it  in  a  recognition  task.  An  object-centered  representation  that  does  not 
allow  the  system  to  infer  what  the  object  would  look  like  from  an  arbitrary  viewpoint  is,  for  all 
practical  purposes,  equivalent  to  a  collection  of  viewer- centered  view-specific  representations. 
Consequently,  if  the  human  visual  system  effectively  employs  object- centered  representations, 
a  person  should  be  able  to  recognize  an  object,  previously  seen  from  a  limited  range  of 
viewpoints,  from  a  novel  viewpoint. 

It  should  be  noted  that  the  structure  from  motion  theorems  ([152],  [150],  [86])  indicate 
that  in  principle  the  effect  of  having  a  full  object-centered  description  of  a  3D  rigid  body 
may  be  produced  by  actually  storing  only  a  few  of  its  2D  views.  The  ability  of  the  human 
visual  system  to  infer  the  3D  structure  of  an  object  from  a  small  number  of  its  projections, 
termed  the  kinetic  depth  effect,  has  been  known  for  a  long  time  ([162],  [71]).  An  alternative 
way  to  formulate  the  object- centered  representation  question  is,  therefore,  to  ask  whether 
the  3D  structure  perceived  in  the  kinetic  depth  effect  is  retained  by  the  long-term  memory, 
or  whether  this  effect  is  a  mere  by-product  of  some  other,  more  basic,  perceptual  operation. 

Until  recently,  the  ability  of  the  human  visual  system  to  recognize  objects  from  novel 
viewpoints  has  been  taken  for  granted.  Palmer,  Rosch  and  Chase  were  the  first  to  demon¬ 
strate  that  even  for  familiar  objects  speed  and  accuracy  of  recognition  varies  with  viewpoint 
[113].  A  more  direct  test  would  involve  novel  objects,  shown  to  the  subjects  under  con¬ 
trolled  conditions  which  guarantee  that  some  views  remain  unseen  until  the  test  time.  Such 
experiments,  involving  novel  3D  wire-like  stimuli,  have  been  carried  out  by  Rock  and  his  col¬ 
laborators  ([134],  [135]).  In  one  series  of  experiments  [134],  subjects  saw  novel  3D  wires,  each 
from  a  single  fixed  viewpoint.  The  subjects  were  subsequently  shown  other,  similar,  objects, 
one  at  a  time,  and  required  to  decide  whether  these  were  the  familiar  wires,  displayed  at  new 
attitudes,  or  unfamiliar  wires.  In  these  experiments,  the  subjects’  performance  approached 
chance  level  when  the  in-depth  rotational  distance  between  training  and  test  views  was  about 
30°.  In  another  experiment  [135],  the  test  called  for  deciding  whether  one  of  two  simultane¬ 
ously  displayed  wire  objects  was  a  replica  of  the  other  object,  shown  at  a  different  attitude. 
Again,  subjects  performed  poorly,  even  when  given  the  opportunity  to  reason  explicitly  about 
the  relative  positions  of  different  features  of  the  two  objects.  As  during  training  the  subjects 
perceived  the  stimuli  in  3D  (due  to  binocular  stereopsis),  the  lack  of  generalization  to  novel 
views  in  this  experiment  could  be  attributed  to  the  subjects’  failure  either  to  retain  3D  infor¬ 
mation,  or  to  translate  it  into  a  format  that  would  permit  later  generalization,  e.g.,  into  an 
object-centered  description.  The  likelihood  of  the  first  interpretation,  namely,  that  subjects 
do  not  include  3D  information  in  a  long-term  memory  representation  of  wire-like  objects, 
is  diminished  by  the  finding  that  the  presence  of  binocular  stereo  cues  significantly  reduces 
recognition  error  rate  ([38];  see  also  [65],  p.81). 

Additional  evidence  in  favor  of  the  hypothesis  that  objects  are  normally  represented  in 
long-term  memory  by  collections  of  2  jD  viewpoint-specific  descriptions  comes  from  the  ex¬ 
periments  of  Tarr  and  Pinker  [143],  who  found  that  naming  time  for  3D  objects  similar  to 
the  ones  used  by  Shepard  and  Metzler  [139]  increased  linearly  with  the  distance  between 
the  presented  view  and  the  closest  learned  view.  This  linear  dependence,  present  in  a  wide 
variety  of  recognition  tasks  (cf.  [113],  [37]),  is  predicted  by  a  theory  that  posits  multiple- view 
representations,  but  is  difficult  to  explain  within  the  framework  of  a  theory  of  recognition  of 
the  viewpoint  normalization  variety  ([87],  [156]).  For  example,  Ullman’s  alignment  scheme 
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([156];  see  section  2)  specifies  an  algorithm  that  computes  the  hypothesised  viewpoint  of  an 
object  model,  given  the  positions  of  a  few  key  model  features  in  the  input  image.  When 
combined  with  a  3D  object- centered  representation  of  the  model,  this  scheme  should  result 
in  recognition  time  that  is  independent  of  the  viewpoint  (unless  implementational  constraints 
bring  about  such  dependence,  in  which  case  the  representation  effectively  ceases  to  be  sym¬ 
metric  with  respect  to  the  object,  i.e.  object-centered). 

3.3.2  Hints  from  physiology 

The  notion  that  the  primate  visual  system  employs  view-specific  representations  of  objects 
is  supported  by  the  findings  of  Gross,  Perrett  and  others  ([51],  [117],  [50],  [116])  that  cells 
in  the  visual  area  IT  in  monkey  respond  preferentially  to  complex  stimuli  such  as  hands  and 
faces.  Perrett  et  al.  [116]  reported  that  some  cells  responded  maximally  to  full  views  of  a 
face,  others  to  the  face  tilted  upwards  or  downwards  by  45°,  and  still  others  to  profile  and 
back  views  of  the  head.  The  cells’  response  remained  strong  when  the  preferred  view  of 
the  face  was  displayed  at  different  scales,  or  rotated  within  the  image  plane  by  as  much  as 
90°.  If  indeed  these  cells  are  at  the  top  of  the  representation  hierarchy  in  the  visual  system, 
then  their  response  pattern  is  most  easily  explained  in  terms  of  viewer- centered  rather  than 
object-centered  representations. 

3.3.3  Computational  considerations 

Recent  computational  studies  provide  further  support  for  the  hypothesis  that  a  recognition 
strategy  based  on  multiple-view  representation  can  in  principle  account  for  much  of  the 
human  performance  in  object  recognition  ([40],  [11],  [119]).  In  particular,  if  the  recognition 
problem  is  formulated  in  terms  of  the  approximation  of  a  mapping  that  associates  a  standard 
view  of  an  object  with  any  of  its  other  views,  powerful  mathematical  tools  from  function 
approximation  theory  are  available  that  can  construct  such  a  mapping  from  examples  [121]. 
In  other  words,  a  system  can  learn  to  recognize  an  object  from  any  viewpoint  merely  by 
being  exposed  to  a  random  set  of  a  few  tens  of  the  object’s  views. 

4  Discussion  and  prognosis 

During  the  last  decade,  the  combined  computational  and  biological  study  of  vision,  pioneered 
by  Marr,  has  resulted  in  progress  in  some  areas  of  vision  research,  notably,  in  understanding 
low-level  vision.  The  computational  approach  proved  most  fruitful  when  the  problem  at  hand 
could  be  formulated  as  a  transformation  between  clearly  defined  representations.  In  other 
words,  principled  computational  solutions  were  first  offered  to  those  problems  for  which  a 
notion  of  competence  (in  Chomsky’s  sense;  see  [93],  p.28)  was  readily  available.  One  such 
problem  was  binocular  stereo,  for  which  both  the  input  representation  (edge  maps  of  the  two 
images)  and  the  output  representation  (a  dense  depth  map)  seemed  to  be  obvious. 

The  computational  paradigm  made  an  important  contribution  to  the  study  of  vision  by 
encouraging  the  generation  of  concrete  and  testable  theories  of  performance.  In  many  cases, 
these  theories  eventually  led  to  a  revision  of  basic  assumptions  of  the  underlying  theory 
of  competence.  For  example,  when  even  the  sophisticated  stereo  algorithms  appeared  to 
be  incomplete  as  models  of  human  performance,  the  nature  of  the  input  and  the  output 
representations  in  stereo  came  under  questioning. 
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The  computational  approach  has  also  increased  the  understanding  of  the  difficulties  as¬ 
sociated  with  high-level  tasks,  such  as  object  recognition.  Despite  early  enthusiasm,  it  now 
becomes  in  erects  ingly  clear  that  no  adequate  competence-level  theory  is  yet  available  for  these 
tasks.  In  particular,  computational  feasibility  and  biological  (psychological  and  neurophysi¬ 
ological)  plausibility  of  3D  object-centered  models  as  the  ultimate  representations  in  object 
recognition  is  now  being  questioned.  Prom  the  preceding  review,  it  appears  that  a  satis¬ 
factory  reconstruction  of  the  visual  world  is  not  feasible,  unless  its  aim  is  a  viewer-relative 
and  qualitative  (as  opposed  to  absolute  and  quantitative)  representation.  Fortunately,  it  also 
appears  that  carrying  out  an  ideal  reconstruction  is  not  a  prerequisite  for  seeing,  insofar  as 
it  does  not  seem  that  biological  vision  relies  on  such  reconstruction. 

The  roots  of  the  present  situation  in  computational  vision  may  be  traced  to  the  philo¬ 
sophical  foundations  of  the  currently  accepted  computational  paradigm  in  cognitive  science. 
An  important  part  of  Marr’s  argument  in  favor  of  the  computational  approach  to  vision  has 
been  functionalist.  The  functionalist  program  in  the  study  of  mind  [125]  has  been  prompted 
by  the  “suspicion  that  there  are  empirical  generalizations  about  mental  states  that  cannot 
be  formulated  in  the  vocabulary  of  neurological  or  physical  theories”  ([42],  p.25).  Glossing 
over  technical  details,  a  common  version  of  the  functionalist  approach  ([126],  [127])  holds 
that  two  intelligent  agents  are  in  the  same  psychological  state  (e.g.,  both  believe  that  they 
see  a  cat)  if  they  are  in  the  same  computational  state.  In  this  sense,  the  two  agents  function 
similarly,  although  their  physical  realizations  may  be  different.  Similarly,  Marr  and  Poggio 
suggested  that  functional,  or  computational,  understanding  of  vision  may  be  achieved  in 
separation  from  the  understanding  of  its  physical  substrate,  just  as  it  is  sufficient  to  study 
aerodynamics  to  understand  flight  in  both  birds  and  airplanes  ([93],  p.345). 

The  advantages  (and  the  feasibility)  of  the  postulate  of  level  separation  as  a  research 
strategy  in  vision  seem  now  less  substantial  than  a  few  yews  ago.  At  the  same  time,  func¬ 
tionalist  theories  of  cognition  are  increasingly  criticized  from  at  least  two  points  of  view.  One 
line  of  argument  starts  with  the  claim  that  the  functional  level  is  not  sufficiently  abstract 
to  allow  interesting  generalizations  about  mental  states  ([127],  p.74).  Others  claim  that  any 
such  generalizations  are,  in  principle,  unwarranted  ([33],  p.224;  see  also  [129]),  or  depend  on 
a  better  understanding  of  the  physical  substrate  of  cognition  [30].  A  speedy  resolution  of  the 
uncertainty  concerning  the  philosophical  foundations  of  the  functionalist  program  appears  at 
the  present  time  unlikely. 

Whereas  a  discussion  of  level  separation  in  computational  vision  leads  one  to  question  the 
basic  premises  of  the  functionalist  program,  another  problematic  issue  mentioned  above,  that 
of  visual  reconstruction  as  a  goal  of  vision,  can  be  addressed  within  the  present  paradigm. 
Persistent  difficulties  with  solving  the  inverse  optics  problem  (reconstructing  the  geometry 
of  visible  objects)  prompt  the  search  for  a  different  competence  theory  of  vision,  one  that  is 
formulated  in  terms  of  more  plausible  representations  than  geometric  models.  Many  of  the 
emerging  research  directions  can  be  described  as  sin  attempt  to  make  the  most  of  <,ne  rich  in¬ 
termediate  representation  provided  by  the  low-level  modules  (see  the  preface  to  [132]).  These 
approaches  focus  on  qualitative  rather  than  quantitative  representations  (e.g.  [146],  [168]), 
on  active  vision  ([110],  [7]),  on  the  application  of  high-level  rules  to  obtain  fast  approximate 
solutions  [131]  and  on  massive  use  of  parallelism,  memory  and  learning  ([120],  [151],  [119]). 
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